Skip to content

[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898) #12802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

StanHatko
Copy link

@StanHatko StanHatko commented Feb 6, 2025

Fixes nondeterminism from the sampler that occurs even at zero temperature. The patch supports both zero and nonzero temperatures for different entries in the same batch, using the correct operation in each case. It does this by multiplying by an is-zero indicator for the zero temperature calculation and is-nonzero indicator for positive temperature case. These two are then summed. For any entry, only one term in the sum is nonzero as the two cases are mutually exclusive.

I created this patch and discussed it in #5404 (comment). I used this patch internally and it greatly reduced variation in answers at zero temperature (remaining nondeterminism I think is due to CUDA nondeterminacy in the steps leading up to calculation of the logits).

Please feel free to make any improvements to this patch you can think of.

FIX #5404 (patch discussed there).

FIX #5898 (part from sampler, nondeterminism from the computation of the logits themselves not covered here).

supports both zero and nonzero temperatures in batch
see vllm-project#5404 (comment)

Signed-off-by: Stan Hatko <stan_hatko@live.com>
Copy link

github-actions bot commented Feb 6, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Stan Hatko <stan_hatko@live.com>
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much appreciated the investigation! IMO this is quite an elegant patch - @WoosukKwon What do you think?

@StanHatko
Copy link
Author

For the pre-commit hook error here, it shows missing SPDX headers in the file tests/lora/test_ultravox.py which I didn't modify at all in my commits. The pre-commit hooks passed on my computer, so I think that was unrelated commits by others to vLLM since.

For the test with increased GPU VRAM usage, some increase makes sense due to the creation of additional tensors. The previous code just did an in-place divide after the conversion to float32, while the current code needs to compute tensors for both cases and sum them. I think correctness of outputs at zero temperature (a very common use case) is more important though than saving GPU VRAM at the sampling step. Also it may be possible to optimize the code to do more stuff in-place, but some new tensors are inevitable I think with doing it this way.

However, it's possible to avoid the increased GPU RAM usage with a custom CUDA kernel. Basically re-write this as a CUDA kernel and call that. No new tensors will be needed as the computations will be done in temporary variables within a loop, with the final output values assigned directly to the output tensor. I think we should first merge this though and then implement the CUDA kernel as an optimization. I have no experience writing CUDA kernels (though I did write and optimize OpenCL code many years ago).

@WoosukKwon
Copy link
Collaborator

Wow this is amazing! Thanks for the thorough investigation!

@WoosukKwon
Copy link
Collaborator

Sorry for the delay in review. LGTM overall, but I'd like to understand the problem in more detail. Let me get back to this within 1~2 days.

@njhill
Copy link
Member

njhill commented Feb 14, 2025

@StanHatko thank you for the investigation.

To me this looks like a workaround to an underlying problem of doing argmax on the logprobs for greedy.

Could you try out this PR which fixes that? #13312

At a minimum I think this should avoid the need for the * 1e9 * (logits == logits.max(dim=1, keepdim=True)[0]) part, i.e. instead just keep

   logits_z = is_zero * logits

Man I forgot how convoluted our v0 sampler code was 😅

@StanHatko
Copy link
Author

Yes fixing the sampler to be deterministic in the temperature = 0 case seems like a better solution, ideally by just taking the largest logit in that case everything here can be avoided. I'll check if that pull request fixes the issue.

@StanHatko
Copy link
Author

Unfortunately I'm having trouble reproducing some of the old non-deterministic runs from before. There were a bunch of changes since and the prompts here were long, complex, and had data that can't be released publicly. With simple prompts I didn't see the non-determinism in some basic tests. If anyone has a minimum working example please post it.

Some performance optimizations for throughput I enabled (in particular enable prefix caching, sorting before feeding in to take advantage of prefix caching, enable prefill chunking, and CUDA graph optimization) together seemed to reduce the amount of non-determinism (went down from 3, 4, 1, 2, 3 unique responses for 5 different queries each repeated 100 times down to 1, 2, 1, 1, 2).

The new PR looks good. I agree the sampler itself is the better place to make the changes, the reason I didn't do that myself is that I wasn't familiar with the sampler code but understood the logits and logprobs code. For the greedy sampling, is it guaranteed the flag will be set for zero-temperature entries? If there's a batch with a mixture of zero and nonzero temperatures, will it use greedy sampling for the zero-temperature entries and regular sampling for the rest?

@njhill
Copy link
Member

njhill commented Feb 17, 2025

Thanks @StanHatko

If there's a batch with a mixture of zero and nonzero temperatures, will it use greedy sampling for the zero-temperature entries and regular sampling for the rest?

@StanHatko yes that's correct, and greedy requests are determined by temperature < _SAMPLING_EPS.

Copy link

mergify bot commented May 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @StanHatko.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Inconsistent Responses with VLLM When Batch Size > 1 even temperature = 0 [Bug]: topk=1 and temperature=0 cause different output in vllm
5 participants